-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AutoParallel] Visualize flow parallel timing diagram in static graph mode #58313
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
@@ -38,6 +38,10 @@ | |||
#include "paddle/fluid/platform/device_event.h" | |||
#include "paddle/phi/backends/device_manager.h" | |||
|
|||
#if defined(PADDLE_WITH_CUDA) | |||
#include "paddle/phi/kernels/autotune/gpu_timer.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpu_timer只与类接口具体的实现方式相关,与类定义无关,应只在使用到的.cc文件中include,而不在基类头文件中include
@@ -103,6 +104,16 @@ ProgramInterpreter::~ProgramInterpreter() { | |||
} | |||
|
|||
void ProgramInterpreter::RunImpl() { | |||
#if defined(PADDLE_WITH_CUDA) | |||
if (FLAGS_auto_parallel_profiler) { | |||
// Note(sonder): Record the start time of the each stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE一般用于解释一些复杂、难以阅读的代码,或提示一些从代码中无法表达的信息。这几行代码非常简单直接,这个NOTE也只是把代码重复讲一遍,可以不需要。
#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP) | ||
stream_timers_.clear(); | ||
std::vector<gpuStream_t> streams; | ||
bool has_default_stream = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paddle框架不会使用空流,不需要处理空流的情况。
void Start() { | ||
struct timeval time_now {}; | ||
gettimeofday(&time_now, nullptr); | ||
start_time_ = (time_now.tv_sec * 1000) + (time_now.tv_usec / 1000.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里可以加注释说明为何需要用CPU时间作为start_time
double start_time, end_time; | ||
std::tie(start_time, end_time) = | ||
interpretercores_[job_idx]->InterpreterRunTime(); | ||
VLOG(0) << "Profiler Info: Job (" << job_idx << "), type = " << job_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里加注释说明这个log的作用,否则其它人不了解的情况下可能错误改动
@@ -0,0 +1,117 @@ | |||
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个脚本可以放在distributed/auto_parallel/static/下
const std::vector<std::string>& feed_names, bool need_fetch = true) = 0; | ||
const std::vector<std::string>& feed_names, | ||
bool need_fetch = true, | ||
bool enable_auto_parallel_profiler = false) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool enable_auto_parallel_profiler = false) = 0; | |
bool enable_job_schedule_profiler = false) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -34,6 +34,10 @@ PADDLE_DEFINE_EXPORTED_bool(new_executor_use_local_scope, | |||
true, | |||
"Use local_scope in new executor(especially used " | |||
"in UT), can turn off for better performance"); | |||
PADDLE_DEFINE_EXPORTED_bool(auto_parallel_profiler, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么还需要这个FLAGS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删除
enable_auto_parallel_profiler_ = enable_auto_parallel_profiler; | ||
|
||
if (enable_auto_parallel_profiler_) { | ||
#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
编译控制宏应该在条件判断外层,否则在宏条件不成立的情况下就会出现
if (enable_auto_parallel_profiler_) {
空白
}
这种奇怪的代码
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
编译控制宏应该在条件判断外层,否则在宏条件不成立的情况下就会出现
if (enable_auto_parallel_profiler_) { 空白 }
这种奇怪的代码
已修改
|
||
if (enable_auto_parallel_profiler_) { | ||
#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP) | ||
gpuStream_t calculated_stream = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是否有必要在每次run的时候都重复获取和设置相同的计算流?可否在CalculateStreamTimer构造时内部自动获取计算流,而不需要外部调用方设置?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是否有必要在每次run的时候都重复获取和设置相同的计算流?可否在CalculateStreamTimer构造时内部自动获取计算流,而不需要外部调用方设置?
已修改,改为了创建的时候传入place_,然后在内部创建好计算流
@@ -211,6 +219,12 @@ class ProgramInterpreter : public InterpreterBaseImpl { | |||
InstructionSchedulingPriorityLess instruction_scheduling_priority_less; | |||
|
|||
std::vector<HookFunc> hookfuncs_; | |||
|
|||
#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP) | |||
phi::CalculatedStreamTimer calculated_stream_timer_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
phi::CalculatedStreamTimer calculated_stream_timer_; | |
phi::CalculatedStreamTimer calculate_stream_timer_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP) | ||
phi::CalculatedStreamTimer calculated_stream_timer_; | ||
#endif | ||
size_t last_calculated_instr_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size_t last_calculated_instr_id; | |
size_t last_calculate_instr_id_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -1040,6 +1063,15 @@ void ProgramInterpreter::RunInstruction(const Instruction& instr_node) { | |||
|
|||
try { | |||
instr_node.WaitEvent(place_); | |||
if (enable_auto_parallel_profiler_) { | |||
#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP) | |||
if (!interpreter::IsCommunicationOp(instr_node) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!calculated_stream_timer_.IsStarted()
只是一个简单的标志判断,且对大多数算子的情况都是False
,而!interpreter::IsCommunicationOp(instr_node)
有许多代码判断逻辑。这种情况应该将!calculated_stream_timer_.IsStarted()
作为&&
语句第一个判断命题,从而借助C++短路机制减少!interpreter::IsCommunicationOp(instr_node)
的实际调用次数,提升代码性能。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -114,6 +114,7 @@ def set_field_default_config(category, field, default_value): | |||
set_field_default_config(PIPELINE, "accumulate_steps", 1) | |||
set_field_default_config(PIPELINE, "generation_batch_size", 1) | |||
set_field_default_config(PIPELINE, "enable_send_recv_overlap", False) | |||
set_field_default_config(PIPELINE, "schedule_profiler", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里看起来只能通过开关控制是否开启profiler,无法指定采样区间。是否可以支持直接设置pipeline.schedule_profiler_start和pipeline.schedule_profiler_end,默认[-1, -1)表示不开启,否则在[start, end)之间开启profiler,并在end-1个step之后退出整个任务的运行。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里看起来只能通过开关控制是否开启profiler,无法指定采样区间。是否可以支持直接设置pipeline.schedule_profiler_start和pipeline.schedule_profiler_end,默认[-1, -1)表示不开启,否则在[start, end)之间开启profiler,并在end-1个step之后退出整个任务的运行。
现在已经有结合 Profiler_auto.nvprof_start
和 Profiler_auto.nvprof_end
来控制采样区间的代码了,在PaddleNLP的 PR里面:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
… graph mode (PaddlePaddle#58313) * merge from openvino master * add InterpreterRunTime() to record interpreter's run time * add profiler helper static to produce json file * add color map and support perfetto format * recover codes * control include env for gpu_timer.h * fix logic for profiler_helper_static.py * fix build error * fix build error * recover thirdparty * add flag control: not support new ir now * set auto_parallel_profiler flag to false * fix * add auto_parallel_profiler as command parameter * fix value name * support gettimeofday for win env * fix win build error * fix win build error * use job_type_to_id * Fixed repeatedly timing the same stream * add step line for timeline * add step timeline and fix logic when job overlap * update time record logic * fix bug when start profile start from none zero step * fix note * remove FLAGS_auto_parallel_profiler * use run config instead FLAGS_auto_parallelxx * fix color map logic * fix color map logic * fix bug when log step does not start from 0 * fix * fix * don't use set_enable_auto_parallel_profiler * fix bug * disable auto_parallel_profiler when not open flag by command line * fix bug * remove resettime * fix build bug * fix * remove set enable * fix build error * fix build error * fix build error * fix ci error * fix * fix run error * fix * fix * fix calculate_stream_timer logic * remove fluid head * fix build error * set default value for enable_job_schedule_profiler
PR types
Others
PR changes
Others
Description
静态图模式下可视化流水并行时序图
静态图模式下自动并行的运行将调用C++端
StandaloneExecutor::Run
,该方法中将顺序执行提前拆分好的Job。本PR的主要目的是将不同设备上Job的运行时序图可视化出来并使用 Chrome::tracing 查看如何使用?
以下以使用
test_pipeline_scheduler
单侧生成日志文件并生成可视化时序图为例进行说明:由于单侧默认会清空掉生成的日志文件,我们需要先将清空日志的逻辑删除并指定log文件夹:
1、在开启FLAG的前提下,运行训练过程并生成log
GLOG_v=0
的目的是产生尽可能少的日志,降低正则匹配的时间2、运行 profiler_helper_static.py 生成json文件
3、使用 Chrome Tracing 打开json文件
也可以使用 perfetto 打开
pipeline_profile_perfetto.json
相关PR: